87 research outputs found
Finite Dimensional Infinite Constellations
In the setting of a Gaussian channel without power constraints, proposed by
Poltyrev, the codewords are points in an n-dimensional Euclidean space (an
infinite constellation) and the tradeoff between their density and the error
probability is considered. The capacity in this setting is the highest
achievable normalized log density (NLD) with vanishing error probability. This
capacity as well as error exponent bounds for this setting are known. In this
work we consider the optimal performance achievable in the fixed blocklength
(dimension) regime. We provide two new achievability bounds, and extend the
validity of the sphere bound to finite dimensional infinite constellations. We
also provide asymptotic analysis of the bounds: When the NLD is fixed, we
provide asymptotic expansions for the bounds that are significantly tighter
than the previously known error exponent results. When the error probability is
fixed, we show that as n grows, the gap to capacity is inversely proportional
(up to the first order) to the square-root of n where the proportion constant
is given by the inverse Q-function of the allowed error probability, times the
square root of 1/2. In an analogy to similar result in channel coding, the
dispersion of infinite constellations is 1/2nat^2 per channel use. All our
achievability results use lattices and therefore hold for the maximal error
probability as well. Connections to the error exponent of the power constrained
Gaussian channel and to the volume-to-noise ratio as a figure of merit are
discussed. In addition, we demonstrate the tightness of the results numerically
and compare to state-of-the-art coding schemes.Comment: 54 pages, 13 figures. Submitted to IEEE Transactions on Information
Theor
Action Recognition by Hierarchical Mid-level Action Elements
Realistic videos of human actions exhibit rich spatiotemporal structures at
multiple levels of granularity: an action can always be decomposed into
multiple finer-grained elements in both space and time. To capture this
intuition, we propose to represent videos by a hierarchy of mid-level action
elements (MAEs), where each MAE corresponds to an action-related spatiotemporal
segment in the video. We introduce an unsupervised method to generate this
representation from videos. Our method is capable of distinguishing
action-related segments from background segments and representing actions at
multiple spatiotemporal resolutions. Given a set of spatiotemporal segments
generated from the training data, we introduce a discriminative clustering
algorithm that automatically discovers MAEs at multiple levels of granularity.
We develop structured models that capture a rich set of spatial, temporal and
hierarchical relations among the segments, where the action label and multiple
levels of MAE labels are jointly inferred. The proposed model achieves
state-of-the-art performance in multiple action recognition benchmarks.
Moreover, we demonstrate the effectiveness of our model in real-world
applications such as action recognition in large-scale untrimmed videos and
action parsing
Semantic Cross-View Matching
Matching cross-view images is challenging because the appearance and
viewpoints are significantly different. While low-level features based on
gradient orientations or filter responses can drastically vary with such
changes in viewpoint, semantic information of images however shows an invariant
characteristic in this respect. Consequently, semantically labeled regions can
be used for performing cross-view matching. In this paper, we therefore explore
this idea and propose an automatic method for detecting and representing the
semantic information of an RGB image with the goal of performing cross-view
matching with a (non-RGB) geographic information system (GIS). A segmented
image forms the input to our system with segments assigned to semantic concepts
such as traffic signs, lakes, roads, foliage, etc. We design a descriptor to
robustly capture both, the presence of semantic concepts and the spatial layout
of those segments. Pairwise distances between the descriptors extracted from
the GIS map and the query image are then used to generate a shortlist of the
most promising locations with similar semantic concepts in a consistent spatial
layout. An experimental evaluation with challenging query images and a large
urban area shows promising results
PALMER: Perception-Action Loop with Memory for Long-Horizon Planning
To achieve autonomy in a priori unknown real-world scenarios, agents should
be able to: i) act from high-dimensional sensory observations (e.g., images),
ii) learn from past experience to adapt and improve, and iii) be capable of
long horizon planning. Classical planning algorithms (e.g. PRM, RRT) are
proficient at handling long-horizon planning. Deep learning based methods in
turn can provide the necessary representations to address the others, by
modeling statistical contingencies between observations. In this direction, we
introduce a general-purpose planning algorithm called PALMER that combines
classical sampling-based planning algorithms with learning-based perceptual
representations. For training these perceptual representations, we combine
Q-learning with contrastive representation learning to create a latent space
where the distance between the embeddings of two states captures how easily an
optimal policy can traverse between them. For planning with these perceptual
representations, we re-purpose classical sampling-based planning algorithms to
retrieve previously observed trajectory segments from a replay buffer and
restitch them into approximately optimal paths that connect any given pair of
start and goal states. This creates a tight feedback loop between
representation learning, memory, reinforcement learning, and sampling-based
planning. The end result is an experiential framework for long-horizon planning
that is significantly more robust and sample efficient compared to existing
methods.Comment: Website: https://palmer.epfl.c
- …